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Abstract 


A  measiare  of  cognitive  workload  was  needed  to  conduct  human  factors 
evaluations  of  telemedicine  applications.  A  literature  review  was 
conducted  to  find  available  metrics  and  select  candidates  for  testing.  Three 
candidate  measures  (the  Subjective  Workload  Assessment  Technique 
[SWAT],  NASA-Task  Load  Index  [TLX]  along  with  its  subscales,  and  the 
Modified  Copper-Harper  [MCH])  were  selected  using  the  following 
criteria:  reliability,  validity,  lack  of  contamination,  availability,  sensitivity, 
lack  of  intrusiveness,  diagnosticity,  and  cost.  All  metrics  in  the  literature 
review,  as  well  as  the  application  of  the  selection  criteria,  are  described  in 
this  report.  Methodological  development  and  research  were  then 
completed  to  develop  a  research  paradigm  for  selecting  the  best  workload 
metric  from  the  three  candidates.  This  effort  included  the  development 
and  norming  of  difficulty  levels  of  a  surrogate  task  in  a  controlled 
experimental  protocol,  the  selection  of  a  spatial  abilities  test,  acquisition 
and  testing  of  required  telecommunication  and  recording  equipment,  and 
the  iterative  development  and  testing  of  a  research  protocol.  These 
processes  and  their  results  are  described  in  detail  in  this  report. 
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SELECTION  OF  A  WORKLOAD  METRIC  FOR  EVALUATION  OF  TELEMEDICINE 
APPLICATIONS:  LITERATURE  REVIEW  AND  METHODOLOGICAL  DEVELOPMENT 


INTRODUCTION 

Telemedicine  literally  means  medicine  at  a  distance.  Presently,  telemedicine  has  been 
defined  as  the  use  of  telecommunications  and  information  technologies  to  provide  health  care. 
This  encompasses  the  diagnosis,  treatment,  monitoring,  and  education  of  patients  regardless  of 
the  patient,  provider,  or  information  location  (Puskin,  Brink,  Mintzer,  &  Wasem,  1995). 

There  have  been  a  number  of  efforts  to  use  telemedicine  to  deliver  health  care  to  remote 
and  medically  underserved  populations  over  the  last  40  years.  A  review  of  the  telemedicine 
programs  during  this  time,  however,  revealed  that  only  one  major  project  continued  to  survive 
after  the  withdrawal  of  external  funding  (Hassel,  1995).  The  reasons  for  the  lack  of  success  of 
these  telemedicine  efforts  are  not  apparent.  This  is  in  large  part  because  few,  if  any,  rigorous 
scientific  evaluations  were  done. 

The  problem  of  evaluating  telemedicine  applications  has  recently  been  recognized  and 
addressed  by  a  number  of  researchers  and  policy  makers  in  the  area  (Bashshur,  1995;  Grigsby, 
Schlenker,  Kaehny,  Shaughnessy,  &  Sandberg,  1995;  Puskin  et  al.,  1995).  In  particular,  the 
Department  of  Defense  (DoD)  Telemedicine  Evaluation  Working  Group  (TEWG)  proposed  a 
conceptual  framework  to  guide  the  development  of  methodologies  to  evaluate  telemedicine 
projects  in  the  DoD.  The  five  areas  to  be  evaluated  in  the  TEWG  framework  are  clinical 
outcomes,  patient-provider  satisfaction,  human  factors,  organizational  impact,  and  costs  and 
benefits. 

One  of  the  areas  in  the  human  factors  evaluation  that  was  determined  to  be  important  was 
the  assessment  of  workload.  It  has  been  shown  in  other  areas  (e.g.,  aviation)  that  changes  in 
technological  applications  have  resulted  in  additional  workload  demands  on  the  operator.  This 
additional  workload  has  been  related  to  decrements  in  performance.  It  is  believed  that  a  similar 
change  in  the  behavioral  and  cognitive  workload  of  the  health  care  provider  may  occur  as  a  result 
of  the  additional  requirements  imposed  by  telemedicine  applications.  This  change  in  workload 
may  result  in  an  increase  in  the  number  of  errors  committed. 

Consequently,  a  review  of  the  cognitive  workload  literature  was  completed  to  identify  the 
most  promising  workload  metrics  for  possible  use  in  measuring  changes  in  workload  in 
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telemedicine  applications.  The  literature  review  and  the  selection  of  the  candidate  measures  is 
described  in  the  next  section  of  this  report. 

It  is  necessary  to  subject  these  candidate  measures  to  empirical  verification  and  validation 
for  use  in  evaluating  telemedicine  applications.  To  maintain  experimental  control,  as  well  as  for 
legal  and  ethical  reasons,  the  original  verification  and  validation  process  must  be  conducted  in  a 
laboratory  before  being  used  in  evaluating  actual  telemedicine  applications.  Therefore,  a 
surrogate  laboratory  task,  which  taps  the  same  cognitive  demands  as  expected  in  telemedicine 
applications,  and  a  laboratory  protocol  for  testing  workload  metrics  were  developed.  This  task 
and  protocol  development  are  described  in  the  present  paper. 

The  work  reported  here  will  serve  as  the  basis  for  further  development  of  a  methodology 
for  evaluating  workload  in  telemedicine  applications.  The  potential  metrics  need  to  be  verified 
and  validated  with  more  appropriate  populations.  Folio-wing  that  work,  it  should  be  possible  to 
extend  the  findings  of  this  research  to  the  evaluation  of  actual  telemedicine  applications. 

LITERATURE  REVIEW 
Database 

A  detailed  analysis  of  subjective  workload  metrics  was  used  to  select  the  metrics  that 
hold  the  most  promise  for  use  in  telemedicine.  This  analysis  examined  human  factors  technical 
and  psychology  electronic  databases  using  the  terms  workload  and  subjective;  cognitive 
workload  and  subjective;  mental  workload  and  subjective',  plus  several  specific  metric  names— 
overall  workload;  OW;  SWAT;  NASA-TLX;  TLX;  Cooper-Harper;  MCH.  From  the  titles 
accessed  by  this  search,  three  comprehensive  reviews,  three  meta-analyses,  and  44  articles 
describing  experimental  results  are  selected  as  a  comprehensive  information  set  upon  which  to 
base  our  metric  selection  decisions. 

Workload  Measurement  Classification 

Cogmtive  workload  measures  can  be  classified  into  three  broad  areas;  physiological, 
performance  (primary  task  or  loading  task),  and  subjective  (Schlegel,  1993). 

Physiological  Measures 

The  human  body  responds  both  cognitively  and  physiologically  to  the  demands  of 
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its  environment  and  its  tasks.  Physiological  measures  that  vary  with  cognitive  demands  have 
been  tested  as  potential  metrics  of  those  cognitive  demands  (Wierwille  &  Eggemeier,  1993). 
These  measures  include  eye  blink  rate,  pupil  diameter,  P300  amplitude  and  latency,  galvanic  skin 
response  (GSR),  heart  rate,  heart  rate  variability,  and  certain  blood  and  urine  fractions  (e.g., 
norepinephrine).  It  is  difficult  to  measure  physiological  responses  because  of  the  large  number 
of  trials  that  must  be  performed  to  obtain  reliable  measures  and  because  of  the  invasive  nature  of 
most  of  the  measurement  technology.  Finally,  those  measures  that  have  been  obtained  often  do 
not  agree  with  one  another  (i.e.,  a  task  demand  may  be  reflected  in  heart  rate  variability  but  not 
in  GSR)  and  do  not  consistently  occur  in  the  literature. 

Performance  Measures 

Performance  measures  for  the  primary  task  are  the  most  direct  indication  of 
changing  cognitive  workload  (Crabtree,  Bateman,  &  Acton,  1984).  When  a  task  requires 
primarily  cognitive  effort,  changes  in  that  task's  performance  might  be  thought  to  provide  the 
best  indication  of  changes  in  the  level  of  cognitive  effort.  However,  this  will  only  prove  to  be 
the  case  if  the  performance  is  sensitive  to  these  changes  in  workload  (Bofif  &  Lincoln,  1988). 
Instead,  suppose  that  a  person  can  perform  a  task  with  low  workload  demands  without  error  and 
without  employing  the  maximum  cognitive  resources  to  complete  the  task.  Then  suppose  that 
the  demands  of  the  task  are  increased;  now  the  worker  can  continue  to  perform  the  task  without 
error  ordy  by  expending  all  his  or  her  cognitive  resources;  that  is,  he  or  she  has  no  spare 
resources  but  is  able  to  maintain  flawless  performance.  In  this  way,  performance  is  not  a 
sensitive  indicator  of  the  changes  in  cognitive  workload. 

A  reasonable  question  to  ask  is  why  one  would  care  about  changes  in  workload 
that  do  not  affect  the  performance  of  the  task  of  interest.  When  a  task  is  completed  during 
testing  conditions,  we  usually  find  that  the  operator  is  rested,  the  communication  among  team 
members  is  perfect,  the  time  on  task  was  limited,  and  no  emergencies  occurred.  In  these 
circumstances,  task  performance  may  not  be  a  sensitive  indicator  of  how  close  an  operator  is  to 
using  all  available  resources.  However,  whenever  any  one  of  these  circumstances  is 
compromised,  as  they  often  are  during  actual  operating  conditions,  then  the  operator  using  all  his 
or  her  cognitive  resources  to  maintain  flawless  performance  during  optimal  circumstances  will 
be  overloaded  and  begin  to  make  errors.  In  contrast,  the  operator  completing  a  task  with  a  lower 
workload  will  have  an  available  cognitive  reserve  to  muster  in  the  face  of  adverse  circumstances. 
This  is  the  reason  that  a  sensitive  measure  of  workload  may  provide  a  better  predictor  of 
operational  performance  than  could  tested  performance  itself 
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One  means  of  improving  the  sensitivity  of  performance  measures  is  to  add  an 
additional  task  that  will  use  all  available  cognitive  resources  even  during  normal  testing 
conditions  (Fisk,  Derrick,  &  Schneider,  1983).  This  procedure  requires  that  the  operator 
complete  two  tasks  concurrently;  one  is  the  task  of  interest  (the  primary  task)  and  the  second  is  a 
loading  task  used  to  push  the  demands  on  the  operator's  resources,  even  during  the  lightest 
primary  task  workload  conditions.  This  is  known  as  a  dual  task  paradigm.  The  result  is  that 
operator  performance  of  the  combination  of  tasks  demands  all  cognitive  resources  at  each  level 
of  primary  task  workload.  In  this  way,  changes  in  that  workload  will  be  accurately  reflected  in 
changes  in  performance  of  one  or  both  of  the  concurrent  tasks.  In  effect,  the  loading  task  is 
acting  in  much  the  same  way  that  the  adverse  circumstances  and  emergency  demands  of  the 
operational  setting  affect  cognitive  demands  and  in  turn,  adversely  affect  task  performance. 

Subjective  Measures 

Operators  are  capable  of  describing  the  difficulty  of  a  task.  Various  measurement 
instruments  have  been  designed  to  quantify  these  difficulty  evaluations  (Gopher  &  Donchin, 
1986).  These  are  known  as  subjective  measures  of  workload.  Because  the  cognitive  workload 
involved  in  the  completion  of  many  tasks  is  the  conscious  work  that  occurs  in  working  memory, 
that  is,  short  term  memory,  the  operators  themselves  are  able  to  report  the  amount  of  cognitive 
effort  expended.  Hence,  numerous  publications  over  the  past  20  years  have  reported  the 
effectiveness  of  subjective  workload  metrics  in  assessing  cognitive  workload.  In  addition  to 
their  sensitivity  and  implied  reliability,  these  measures  have  face  validity  and  have  provided 
validity  when  compared  with  task  performance  (Eggemeier,  McGhee,  &  Reid,  1983;  Boyd, 

1983).  They  are  relatively  inexpensive  to  collect  and  are  usually  nonintrusive  on  the  task  itself. 
That  is,  the  subjective  workload  measure  can  be  collected  without  interfering  with  task 
performance  (Eggemeier,  Melville,  &  Crabtree,  1984). 

Criteria  for  Selection 

The  goal  for  the  present  review  is  to  determine  candidate  workload  measures  for  the 
assessment  of  cognitive  workload  in  telemedicine  applications.  To  be  useful,  any  measurement 
must  meet  four  criteria:  reliability,  validity,  lack  of  contamination,  and  availability.  Successful 
workload  metrics  should  meet  four  additional  (and  not  strictly  independent)  criteria:  sensitive, 
nonintrusive,  diagnosticity,  and  cost  effectiveness.  Therefore,  each  candidate  class  of  measures 
and  each  measure  itself  will  be  evaluated  for  these  criteria. 

The  criteria  are  defined  and  described  in  the  following  section: 
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1 .  Reliability  is  the  repeatability  of  a  measure.  When  a  measure  is  reliable,  then  repeated 
occasions,  similar  tasks,  or  judges  will  obtain  similar  measurement  levels.  Without  reliability,  a 
measure  cannot  be  sensitive  or  valid.  Therefore,  finding  validity  and  sensitivity  implies  that 
reliability  exists;  however,  it  is  far  better  to  assess  reliability  directly,  although  this  is  too  seldom 
done  in  operational  settings  (Lysaght  et  al.,  1989). 


2.  Validity  is  the  degree  to  which  a  metric  actually  measures  the  concept  it  is  intended  to 
measure.  For  example,  an  intelligence  test  is  valid  if  it  measures  abilities  as  opposed  to 
measuring  achievement. 

3.  Contamination  occurs  when  a  metric  is  confounded  with  other  influences,  unrelated  to 
the  measurement  of  interest.  For  example,  contamination  in  workload  measures  would  occur 
when  physical  effort  to  complete  the  workload  assessment  confounds  the  measurement  of 
cognitive  workload  for  the  task  per  se.  Lack  of  contamination  is  important  to  any  satisfactory 
metric. 


^  4.  Availability  indicates  the  ability  to  obtain  the  measurement.  Availability  may  be 
limited  by  access,  funding,  or  intrusiveness  into  the  task  domain  itself 

5.  Sensitivity  is  the  extent  to  which  changes  in  the  item  to  be  measured  are  reflected  by 
changes  in  the  measuring  instrument.  Lack  of  sensitivity  will  decrease  both  reliability  and 
validity.  An  example  of  an  insensitive  workload  measure  was  given  earlier  in  the  form  of  some 
primary  task  performance  measurements. 

6.  Intrusiveness  means  the  extent  to  which  performance  of  the  primary  task  is  interrupted 
by  the  workload  metric.  Any  concurrent  demands  for  obtaining  the  measurement  of  workload 
have  the  potential  to  intrude  on  the  primary  task,  but  not  all  appear  to  do  so.  Nonintrusiveness  is 
an  important  criterion  of  a  useful  workload  metric. 

7.  Diagnosticity  refers  to  the  ability  of  a  metric  to  determine  what  aspect  of  the  task  is  the 
source  of  the  imposed  workload,  that  is,  what  operator  resource  is  more  severely  taxed  (see 
Polzella  &  Reid,  1987,  and  Vidulich  &  Wickens,  1986,  for  contrasting  views).  If  an 

unacceptably  high  workload  is  fotmd,  then  a  diagnostic  metric  will  pinpoint  the  cause  of  that 
overload. 
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8.  Cost  must  be  evaluated  against  the  value  obtained  from  knovving  the  workload 
information.  The  relationship  between  the  value  of  the  workload  information  obtained  and  the 
cost  of  obtaining  it  is  the  cost  effectiveness  of  the  metric. 


Application  of  Criteria  to  Workload  Metrics 

These  evaluation  criteria  can  be  applied  to  each  of  the  three  broad  classifications  of 
workload  metrics:  physiological,  performance  (primary  and  dual),  and  subjective. 

Physiological  measures  have  lacked  reliability  during  similar  test-retest  conditions. 
Furthermore,  when  multiple  physiological  measures  are  obtained,  they  often  do  not  correlate 
with  one  another  in  reflecting  changes  in  cognitive  workload.  Without  reliability,  validity  is  not 
possible;  therefore,  the  question  of  validity  can  only  be  considered  when  a  physiological  measure 
has  been  reliable.  Physiological  measures  are  frequently  contaminated  by  artifacts  from  other 
physiological  activities  (e.g.,  eye  blinks,  breathing,  or  muscle  movements).  Although  some 
physiological  measures  can  be  obtained  directly,  most  interest  in  the  assessment  of  cognitive 
workload  (e.g.,  P300  evoked  brain  potentials)  requires  the  use  of  high  technology  equipment  to 
measure  small  electrical  impulses,  separate  them  from  surroimding  signals,  and  analyze  them 
statistically.  The  sensitivity  of  these  measures  has  been  found  in  some  cases,  but  often  it  is  not 
found.  A  specific  application  of  P300  in  the  measurement  of  perceptual  workload  has  been 
found  when  using  a  secondary  task  to  elicit  the  P300.  In  this  case,  some  diagnosticity  was  found 
(Gopher  &  Donchin,  1986).  Finally,  the  need  for  equipment  attached  to  the  operator  results  in 
very  intrusive  and  expensive  measurement  methodology. 

Performance  measures  might  be  thought  to  be  reliable  and  valid  measures  of  cognitive 
workload  solely  by  their  definition.  This  statement  assumes  that  performance  results  from 
cognitive  workload  alone.  However,  especially  when  using  only  a  primary  task,  this  has  not 
always  been  the  case.  Employing  a  second  loading  task  has  improved  the  sensitivity  of 
performance  as  an  indicator  of  cognitive  workload.  Unfortunately,  the  use  of  dual  task 
paradigms  may  result  in  decrements  in  the  primary  task  or  the  loading  task  or  both,  as  workload 
increases.  This  may  compromise  the  safety  of  the  primary  task  in  an  operational  setting,  and 
even  in  an  experimental  setting,  it  makes  interpretation  of  the  results  difficult.  The  only  sources 
of  contamination  that  have  been  reported  are  the  cross  linking  of  demands  from  the  two 
concurrent  tasks.  Intrusion  from  the  loading  task  can  be  alleviated  by  careful  selection  of  the 
loading  task  itself.  One  successful  method  has  been  to  develop  imbedded  secondary  tasks 
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specific  to  each  type  of  primary  task  being  evaluated.  The  costs  of  obtaining  performance 
measures,  whether  primary  or  loading  task  performance,  are  moderate. 

Reliable  subjective  measures  have  been  developed  (e.g.,  SWAT  and  NASA-TLX).  This 
cannot  be  claimed  for  all  subjective  workload  measures  that  have  been  employed  (Gopher  & 
Donchin,  1986).  Furthermore,  cluster  analyses  (Derrick,  1983)  have  confirmed  that  these 
measures  are  valid  in  assessing  a  variety  of  the  cognitive  demands  that  impact  workload.  These 
measures  can  be  easily  contaminated  by  experimenter  expectations  and  operator  motivation. 

Care  must  be  taken  to  avoid  these  problems  when  using  subjective  workload  measures,  and  the 
procedures  for  administering  the  well-developed  metrics  have  taken  these  precautions.  Standard 
metrics  for  assessing  subjective  workload  have  been  established  for  other  domains  such  as  flight 
and  communication,  but  they  have  not  been  employed  in  telemedicine  applications.  The 
sensitivity  of  some  metrics  has  accurately  reflected  changes  in  cognitive  workload  demands  (e.g., 
signal  rate,  short  term  memory,  and  auditory  communication  requirements)  (Eggemeier, 

Crabtree,  &  LaPointe,  1983;  Moroney,  Biers,  &  Eggemeier,  1995).  These  metrics  can  be 
collected  after  the  primary  task  is  completed,  and  hence,  they  are  nonintrusive;  their  cost  is  low. 
See  Table  1  for  a  summary  of  this  analysis. 

In  the  initial  analysis  of  the  three  broad  classes  used  to  measure  cognitive  workload,  the 
category  of  subjective  workload  metrics  is  the  most  satisfactory  when  evaluated  by  these  test  and 
evaluation  criteria.  They  meet  the  standards  of  reliability,  validity,  and  lack  of  contamination. 
Several  metrics  have  been  standardized  and  have  been  tested  in  other  domains.  In  these 
domains,  such  metrics  have  been  sensitive  indicators  of  workload,  as  well  as  predictors  of  task 
performance.  In  general,  subjective  metrics  are  not  thought  to  be  global  indicators  of  workload; 
they  are  not  particularly  diagnostic  of  the  source  of  this  overload.  They  are  the  least  expensive  of 
all  metrics  (other  than  observing  primary  task  performance  alone).  The  nonintrusive  nature  of 
subjective  workload  measures  is  a  very  important  criterion  for  their  use  in  the  operational 
settings  of  telemedicine  practices. 
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Table  1 


Evaluation  of  Broad  Workload  Classifications 


Workload  classification 


Performance 


Criterion 

Physiological 

Primary 

Loading 

Subjective 

Reliability 

Poor 

Good 

Good 

Generally  good 

Validity 

Variable 

Variable 

Good 

Good 

Contamination 

Variable 

Variable 

Variable 

Good 

Availability 

Poor 

Good 

Variable 

Good 

Sensitivity 

Variable 

Variable 

Good 

Good 

Intrusive 

Poor 

Good 

Variable 

Good 

Diagnostic 

•  Good(P300) 

Poor 

Good 

Poor 

Cost 

Poor 

Good 

Moderate 

Good 

Subjective  Workload  Metrics 
Background 

Moray  (1982)  published  a  comprehensive  review  of  subjective  mental  workload 
examining  the  literature  fi'om  1968,  when  cognitive  measures  of  performance  were  first 
beginning  to  be  examined  by  the  human  factors  community.  He  reports  that  few  studies  had 
been  published  during  that  time,  but  his  analysis  of  those  studies  is  particularly  helpful  for  the 
present  task:  selecting  workload  metrics  for  telemedicine  applications.  This  review  was  divided 
into  four  categories,  of  which,  three  are  relevant  to  the  cognitive  demands  of  telemedicine 
procedures:  cognitive,  manual  control,  and  time  stress  tasks. 

For  the  analysis  of  cognitive  tasks,  a  global  measure  of  subjective  workload  (such 
as  “On  a  scale  fi'om  1  to  9,  how  difficult  is  this  task?”)  correlated  better  than  r  =  0.90  with  task 
performance.  This  result  has  tended  to  be  substantiated  by  ejq)erimental  results  in  the  ensuing 
decade  when  primary  task  paradigms  were  tested;  however,  global  subjective  workload  measures 
have  been  found  to  dissociate  firom  performance  when  dual  task  paradigms  or  tasks  requiring 
either  overleamed  (automated)  or  complex  responses  are  employed  (Wickens  &  Yei-Yu,  1983; 
Vidulich&  Wickens,  1986). 
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Manual  control  tasks  assessed  were  all  flight  control  tasks.  The  primary 
assessment  tool  was  a  subjective  rating  scale  of  handling  characteristics  called  the  Cooper- 
Harper  (CH)  scale.  The  focus  of  this  review  was  on  the  characteristics  of  the  manual  control 
tasks  that  affected  subjective  workload.  Both  order  of  control  and  display-to-response  lag 
increased  subjective  workload.  This  is  consistent  with  the  performance  literature  which  has 
found  increases  in  error  rates  with  as  little  lag  as  250  msec  in  speech  signals.  The  upper  limit  on 
lag  that  can  be  accommodated  at  all  in  continuous  manual  control  tasks  is  5  seconds. 
Furthermore,  the  requirement  to  complete  concurrent  manual  control  tasks  and  the  introduction 
of  instability  into  the  control  system  also  reliably  increased  workload  ratings  on  the  CH  scale.  A 
medical  analogue  to  this  manual  control  task  occurs  in  laparoscopic  gall  bladder  surgery  when 
more  than  one  manipulator  must  be  controlled  inside  a  patient's  closed  abdomen.  This 
lapyroscopic  surgery  is  analogous  to  teleproctored  surgery  because  the  surgeon  must  view  the 
operation  indirectly  through  a  display  on  a  color  monitor.  If  remote  transmission  produces  a  lag 
in  the  visual  display  system,  a  major  source  of  documented  workload  will  be  introduced.  The 
CH  scale  has  been  sensitive  to  this  lag. 

Finally,  time  stress  has  been  an  important  driver  of  cognitive  workload  and  is  a 
factor  in  some  medical  procedures  considered  for  telemedicine  intervention  (e.g.,  surgery, 
emergency  room  medicine).  Philipp,  Reiche,  and  Kirchner  (1971)  found  that  the  workload  for 
air  traffic  controllers  who  were  on  duty  for  several  hours  could  be  assessed  using  a  nine-point 
scale  for  two  global  questions:  How  difficult  is  the  task?  and  How  much  time  stress  is  there? 
Objective  measures  of  information  processed  and  time  pressure  for  communication  were 
correlated  with  the  two  subjective  measures.  These  correlations  were  r  =  0.69  and  r  =  0.56, 
respectively,  indicating  a  significant  relationship  between  the  objective  and  the  subjective 
measures.  These  correlation  levels  are  well  within  the  accepted  levels  for  measuring  validity. 

This  backgrormd  describes  the  subjective  workload  research  issues  that  emerged 
along  with  a  revival  of  general  interest  in  cognitive  psychology  approximately  25  years  ago. 
Subsequent  interest  in  this  method  of  assessing  workload  has  resulted  in  a  number  of  tested 
subjective  workload  assessment  techniques.  These  metrics  are  described  next. 

Candidate  Measurement  Tools 

A  number  of  candidate  metrics  have  been  developed  and  tested  (see  Lysaght  et 
al.,  1989,  and  Boff  &  Lincoln,  1988,  for  reviews).  The  present  analysis  targets  metrics  that  may 
be  of  particular  use  in  the  assessment  of  cognitive  workload  in  telemedicine. 
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Subjective  'workload  metrics  may  be  divided  into  two  general  categories:  rating 
scales,  which  provide  quantitative  measures  of  subjective  workload,  and  questionnaires  and 
interviews,  which  provide  qualitative  information  and  lessons  learned.  Many  measmes  of 
subjective  workload  have  been  developed  solely  for  their  application  to  a  single  study  or  to  a 
single  area.  These  measures  are  not  discussed  since  only  measures  that  have  hopes  of 
generalizing  from  other  domains  are  reasonable  candidates  for  evaluating  telemedicine 
applications.  Several  ratings  scales  have  been  subjected  to  test  and  evaluation  development  and 
shown  to  be  valid  in  previous  research  and  will  be  considered.  They  are  described  in  the 
following  section: 

•  Cooper-Harper  Scale  (including  modified  Cooper-Harper)  is  a  widely  used 
metric  which  was  originally  developed  for  assessing  aircraft-handling  capabilities. 
It  has  been  a  sensitive  indicator  of  workload  for  motor  or  psychomotor  tasks 
(Wierwille  &  Connor,  1983).  A  modified  version  called  the  modified  Cooper- 
Harper  (MCH)  has  been  used  successfully  to  assess  perceptual  and  cognitive 
requirements  (Wierwille  &  Casali,  1983).  One  factor  to  consider  in  using  the  CH 
or  MCH  is  that  it  is  a  rating  scale  which  produces  only  ordinal  scale  data,  thus 
limiting  analysis  of  statistical  significance  to  non-parametric  tests. 

•  NASA-Task  Load  Index  (NASA-TLX)  and  its  subscales  is  a  group  of  six 
scales  reflecting  separate  dimensions  of  workload  and  an  overall  workload  rating 
(Hart  &  Mashkati,  1988).  These  dimensions  include  cognitive  loading  factors 
such  as  time  pressure  and  mental  effort,  as  well  as  physical  factors  such  as 
amount  of  physical  effort.  The  rating  is  a  20-point  scale  which  is  assumed  to  be 
interval.  The  NASA-TLX  has  imdergone  extensive  and  rigorous  theoretical 
development  and  evaluation.  Although  the  TLX  has  been  used  most  extensively 
to  evaluate  flight  tasks,  it  has  been  used  to  assess  workload  in  laboratory  tasks 
(e.g.,  short  term  memory,  visual  search,  and  target  acquisition).  It  has  been  found 
to  be  a  valid,  reliable,  and  sensitive  measure  of  cognitive  workload.  The  TLX  is 
preferred  to  the  longer  NASA-bipolar  measure  because  of  the  easier 
administration  of  the  TLX  and  the  failure  to  demonstrate  an  advantage  of  the 
bipolar  version. 

•  Subjective  Workload  Assessment  Technique  (SWAT)  is  a  group  of  three 
scales  reflecting  separate  dimensions  of  workload:  time  pressure,  mental  stress, 
and  effort.  SWAT  has  undergone  extensive  theoretical  development  and  has  been 
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evaluated  in  both  aviation  and  non-aviation  environments  (e.g.,  Eggemeier  & 
Stadler,  1984;  Eggleston,  1984;  Heffley,  1983;  Detro,  1985).  Use  of  conjoint 
measurement  converts  these  subscale  ratings  into  a  single  workload  measure 
which  is  interval  instead  of  ordinal  (Nygren,  1991).  This  metric  has  been  a 
sensitive,  reliable,  and  valid  measure  of  cognitive  workload.  As  currently  used, 
there  are  only  three  rating  levels  for  each  dimension  (subscale).  As  the  result  of 
current  test  and  evaluation  studies  (see  Moroney,  Biers,  &  Eggemeier,  1995;  Biers 
&  Mclnemey,  1988),  it  may  be  possible  to  eliminate  the  current  scaling  procedure 
necessary  for  conjoint  measurement.  This  scaling  procedure  (called  a  card  sort) 
has  limited  the  number  of  levels  on  each  subscale.  If  the  card  sort  is  not  used, 
increasing  the  levels  on  each  subscale  from  three  to  five  may  improve  sensitivity 
and  remove  floor  and  ceihng  efforts. 

•  Psychophysical  Scaling  (e.g.,  magnitude  estimation)  asks  that  operators  report 
the  workload  imposed  by  a  task  in  comparison  to  some  other  task  or  standard. 

For  example,  using  magnitude  estimation,  a  standard  task  will  be  assigned  a 
numerical  value  and  operators  are  asked  to  compare  a  task's  workload  to  that  of 
the  standard  task  by  assigning  a  numerical  value  to  the  current  task.  Using  paired 
comparisons,  all  tasks  are  paired  and  the  operator  chooses  the  one  of  the  pair  with 
the  higher  subjective  workload  (see  Acton,  Crabtree,  Simons,  Gomer,  &  Eckel, 
1983,  for  an  application).  The  difficulty  with  this  procedure  is  that  number  of 
pairs  of  tasks  (n)(n-l)/2  increases  too  rapidly  as  the  number  of  tasks  themselves 
(n)  increases.  Equal-appearing  intervals  ask  operators  to  assign  tasks  to 
categories  judged  to  be  of  increasing  difficulty.  The  categories  are  interval  scales. 
Although  extensive  work  has  been  done  in  the  development  of  psychophysical 
scaling  techniques  forjudging  laboratory  stimuli,  little  work  has  been  reported 
from  operational  settings  or  from  workload  measurement.  The  potential  is  there, 
but  it  awaits  further  work  to  determine  its  applicability. 

•  Stockholm  Scales  are  the  result  of  early  work  at  the  University  of  Stockholm  in 
the  development  of  a  univariate  (nondimensional)  measure  of  workload.  This 
measure  was  validated  using  items  on  an  intelligence  test  that  measured  spatial 
ability,  reasoning  ability,  and  verbal  comprehension.  (Note  that  all  these  tasks  are 
processed  in  conscious  or  working  memory  and  hence  should  be  readily  available 
for  subjective  evaluation  by  the  subject.)  The  reliability  and  validity  as  measured 
by  this  evaluation  were  very  high.  An  1 1 -point  version  of  this  scale  was  used  to 
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assess  spare  mental  capacity  in  a  dual  task  paradigm  using  laboratory  tasks. 

These  tasks  were  either  perceptually  demanding  (e.g.,  target  acquisition)  or 
demanding  of  central  processing  capacity.  In  both  cases,  the  Stockholm  Scale 
correlated  well  with  performance  and  secondary  task  measures  of  spare  capacity 
(i.e.,  it  was  sensitive  to  changes  in  the  primary  task  difficulty).  The  scale  is 
designed  to  measure  effort  as  available  spare  central  processing  capacity,  not 
motor  or  psychomotor  control. 

•  Overall  Workload  (OW)  Each  of  the  scales  described  can  be  used  as  an  overall 
workload  measure;  some  (e.g.,  NASA-TLX  and  SWAT)  also  have  subscales  that 
may  allow  diagnostic  analysis  of  the  source  of  the  workload  when  overload 
occurs  (Hendy,  Hamilton,  &  Landry,  1993).  The  initial  focus  of  a  workload 
analysis  is  to  determine  whether  there  is  an  overload  that  must  be  remedied  before 
the  can  be  completed  safely  with  a  reasonable  degree  of  operator  workload. 

If  overload  is  found,  further  cognitive  task  analysis  can  be  used  to  evaluate  the 
cause. 


Conclusions 

The  three  measurement  scales  that  have  undergone  most  extensive  theoretical 
development  and  are  most  relevant  for  the  present  evaluation  are  the  MCH,  NASA-TLX,  and 
SWAT.  Each  has  been  a  valid  and  reliable  predictor  of  workload  in  several  fields.  Of  the  three, 
MCH  appears  to  be  the  most  likely  to  measure  any  motor  or  psychomotor  components  of  a 
medical  procedure.  NASA-TLX  has  had  less  testing  outside  the  aviation  world  than  has  SWAT, 
but  it  has  been  shown  to  correlate  well  with  SWAT  and  MCH  results  in  those  cases  when  two  or 
more  of  these  metrics  have  been  tested  together  (e.g.,  Vidulich  &  Tsang,  1986;  Warr,  Colle,  & 
Reid,  1986;  see  also  Lysaght  et  al.,  1989,  for  a  summary  review).  SWAT  has  been  a  sensitive 
predictor  of  increasing  task  difficulty,  measuring  increased  workload  before  the  point  that  task 
difficulty  leads  to  a  decrement  in  performance  (Whitaker,  Peters,  &  Garinther,  1989). 

Each  of  these  metrics  has  been  more  or  less  sensitive  to  changes  in  task  difficulty, 
depending  upon  the  domain  in  which  they  have  been  used.  This  domain-specific  aspect  requires 
that  comparisons  be  made  among  these  candidate  measures  to  determine  which  is  the  most 
effective  in  evaluating  cognitive  workload  for  various  telemedicine  procedures. 
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The  following  sections  of  this  report  describe  the  development  of  a  research  protocol, 
which  can  be  used  to  determine  the  most  sensitive  workload  metric  from  among  the  candidate 
pool. 

METHODOLOGICAL  DEVELOPMENT  AND  EXPERIMENTATION 

A  performance  task  using  puzzle  patterns  was  developed,  and  the  difficulty  level  of  each 
pattern  was  assessed  in  an  experimental  protocol.  A  measure  of  individual  differences  was 
obtained  and  tested,  and  a  research  protocol  was  developed  through  testing  of  the  surrogate  task 
and  telecommunications  equipment.  The  results  of  this  effort  are  described  in  the  present  report. 


Development  and  Assessment  of  Performance  Task:  Puzzle  Patterns 
Rationale 

To  maintain  experimental  control,  as  well  as  for  ethical  and  legal  reasons,  it  was 
not  possible  to  use  an  actual  medical  procedure  in  the  planned  assessment  of  workload  metrics. 
Therefore,  an  alternate  task  that  shared  the  cognitive  demands  of  such  procedures  was  needed. 
The  following  demands  were  considered  to  be  essential  for  a  surrogate  task: 

1 .  Teamwork — ^Teamwork  between  at  least  two  team  members  is  required.  In  a 
telemedicine  application,  at  least  one  person  is  located  remotely.  He  or  she  is 
communicating  with  either  another  health  practitioner  or  a  patient  at  a  distance. 

2.  Visual-Spatial  Requirement — ^Many  telemedicine  applications  require  the 
transmission  of  video  images  to  be  evaluated  by  a  remotely  located  specialist. 
Therefore,  the  task  had  to  incorporate  the  visual-spatial  requirements  of  those 
telemedicine  applications. 

3.  Communication — ^A  communication  component  was  needed  because  one  way 
in  which  face-to-face  (also  called  co-located)  conditions  and  telemedicine 
conditions  differ  is  in  the  need  for  one  health  care  practitioner  to  provide 
information  to  a  remotely  located  health  care  practitioner  via  audio-video 
channels. 

4.  Performance  Demands — The  task  had  to  place  accuracy  and  time  pressure 
constraints  on  the  team  members  so  that  the  outcome  will  produce  sensitive 
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performance  indicators.  In  this  way,  it  is  possible  to  assess  the  correlation 
between  successful  task  execution  and  subjective  workload. 

5.  Psychomotor  Component — ^Many  medical  procedures  have  a  large 
psychomotor  component.  The  ultimate  goal  for  the  selected  surrogate  task  is  that 
it  will  be  useful  for  assessing  workload  during  medical  procedures.  Therefore,  a 
task  that  has  a  psychomotor  component  was  needed. 

Development  of  a  Surrogate  Task 

Using  this  rationale  as  the  basis  for  selecting  a  surrogate  task,  a  search  for  existing 
normed  and  validated  tasks  was  conducted.  A  potential  match  was  found  in  the  spatial  abilities 
block  pattern  task  of  the  WAIS-R  Intelligence  Test  (Wechsler,  1981).  In  this  test,  a  person  is 
suppose  to  construct  a  two-color  pattern  from  blocks,  which  matches  the  pattern  shown  on  a 
display  card.  The  WAIS-R  contains  five  four-block  patterns  and  four  nine-block  patterns.  This 
task  met  each  of  the  four  cognitive  criteria  established  for  selecting  a  surrogate  task  and  can  be 
modified  to  include  a  psychomotor  component.  Furthermore,  it  has  validity  in  that  mampulation 
of  blocks  to  form  a  pattern  is  used  in  the  training  of  surgeons  for  ophthalmic  procedures. 

Available  Norming  Data 

The  test  manual  claims  that,  during  the  development  of  the  WAIS-R,  performance 
data  were  obtained  to  measure  the  difficulty  of  the  nine  patterns.  However,  these  norming  data 
were  not  available  firom  either  the  research  department  or  the  legal  department  of  Psychological 
Corporation,  despite  repeated  inquiries.  The  following  information  was  available:  (a)  the  earlier 
patterns  are  easier  than  the  later  patterns,  and  (b)  all  four-block  patterns  are  easier  than  all  nine- 
block  patterns.  Therefore,  only  ordinal  scaling  was  assumed  and  the  number  of  difficulty  levels 
was  not  known. 

Creating  Additional  Patterns 

The  design  of  the  experimental  protocol  for  the  application  of  this  surrogate  task 
was  going  to  require  as  many  as  54  different  puzzle  patterns.  WAIS-R  provided  only  nine 
patterns.  Therefore,  it  was  necessary  to  develop  many  additional  patterns.  These  additional 
patterns  were  developed  in  the  following  ways: 
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•  The  original  pattern  was  rotated  90°  or  1 80°.  A  rotation  of  30°  is  scored  as  a 
different  pattern  in  the  WAIS-R;  therefore,  any  rotation  greater  than  30°  should  be 
discriminable. 

•  The  colors  were  reversed. 

•  A  random  change  was  made  in  one  of  the  original,  rotated,  or  reversed  patterns  to 
generate  a  discriminable  pattern  with  a  similar  appearance. 

Four  of  the  original  WAJS-R  patterns  were  used  and  1 1  rotated  patterns,  five 
color  reversal  patterns,  and  36  random  alteration  patterns  were  added  to  the  original  set  to 
produce  a  complete  set  of  56  puzzle  patterns.  Each  pattern  was  assigned  a  letter  code  ranging 
from  A  through  ddd  in  random  order.  These  56  patterns  are  shown  in  Figure  1 . 

Assessing  Pattern  Difficulty 

Numerous  scaling  methods  can  be  used  to  assess  perceived  task  difficulty.  Two 
have  been  sensitive,  reliable,  valid,  uncontaminated,  and  manageable:  magnitude  estimation  and 
rank  ordering  (Kling  &  Riggs,  1972).  These  two  methods  and  their  application  to  this 
assessment  are  described  next. 

Magnitude  estimation  asks  the  observer  to  assign  a  number  to  each  item  being 
assessed.  This  number  is  to  reflect  the  level  of  the  variable  being  assessed  (in  this  case,  pattern 
difficulty).  A  range  of  possible  magnitudes  is  given  and  sometimes  an  anchoring  value  is  used, 
although  this  anchor  can  lead  to  distortions.  Magnitude  estimation  can  produce  interval  scaled 
data.  In  this  specific  case,  an  observer  was  shown  a  set  of  cards,  each  showing  one  of  the  27 
four-block  patterns.  The  observer  was  allowed  to  look  at  each  of  the  patterns  and  to  make  any 
comparisons  while  examining  the  set.  Next,  the  experimenter  shuffled  the  cards  and  then 
showed  the  cards  one  at  a  time  and  asked  the  observer  to  assign  a  magnitude  between  1  and  50  to 
each  card.  The  29  nine-block  patterns  were  assessed  in  the  same  way  except  that  the  range  of 
magnitudes  was  51  to  100. 

Rank  Ordering  asks  the  observer  to  place  the  items  in  an  order  of  increasing  value 
on  the  variable  being  assessed.  Rank  ordering  produces  ordinal  scaled  data.  In  this  case,  after 
completing  the  magnitude  estimation  task  for  the  four-block  patterns,  the  experimenter  again 
shuffled  the  cards  and  then  asked  the  observer  to  place  the  cards  in  order  from  the  easiest  to  the 
most  difficult  pattern.  After  completing  the  magnitude  estimation  task  for  the  nine-block 
patterns,  the  observer  rank  ordered  this  set. 
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10  11  12 
Figure  1.  Puzzle  patterns  developed  for  experimental  paradigm. 
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Figure  1.  (continued) 
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^  (continued) 


Figure  1 
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Figure  1.  (continued) 
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BLOCK  PATTERN  CODES 


4  =  4  block  pattern  W  =  WAIS-R  original  pattern 

9  =  9  block  pattern  C  =  color  reversal  of  a  WAIS-R  pattern 

VE  =  very  easy  pattern  R  =  rotated  WAIS-R  pattern 

E  =  easy  pattern  X  =  pattern  created  by  experimenter 

M  =  moderate  pattern  P  =  pattern  used  for  practice  only 

D  =  difficult  pattern 


1.)  4-E-C 

13.)4-E-R 

25.)  4-E-X 

37.)  9-H-X-P 

49.)  9-M-X-P 

2.)  4-E-C 

14.)  4-E-X 

26.)  4-E-X 

38.)  9-M-X 

50.)  9-M-X 

3.)  4-YE-X 

15.)4-E-R 

27.)  4-YE-X 

39.)  9-M-X 

51.)  9-M-X 

4.)  4-E-X 

16.)4-E-X-P 

28.)9-M-X 

40.)  9-M-X 

52.)  9-H-W 

5.)  4-E-X 

17.)  4-YE-C 

29.)  9-M-X 

41.)  9-M-X 

53.)  9-M-X 

6.)  4-YE-X 

18.)  4-VE-W 

30.)  9-M-X 

42.)  9-M-X 

54.)  9-M-X 

7.)  4-YE-X 

19.)  4-E-R 

31.)  9-H-R 

43.)  9-H-R-P 

55.)  9-H-R 

8.)  4-VE-W 

20.)  4-VE-R-P 

32.)  9-H-R 

44.)  9-M-X-P 

56.)  9-M-X 

9.)  4-E-X 

21.)  4-VE-X-P 

33.)  9-H-R 

45.)  9-M-W 

10.)  4-VE-R 

22.)  4-YE-X 

34.)  9-M-X 

46.)  9-H-X-P 

11.)  4-YE-X 

23.)  4-VE-R 

35.)  9-M-R 

47.)  9-M-X 

12.)  4-YE-C 

24.)  4-E-W 

36.)  9-M-X 

Fieure  1.  ('continued') 

48.)  9-M-X 

Results 

Eight  independent  observers  were  asked  to  provide  magnitude  estimations  and 
rank  orderings  of  the  56  puzzle  patterns.  Three  types  of  statistical  analyses  were  conducted; 
correlations,  descriptive  statistics,  and  regression.  First,  correlations  were  computed  to  assess  the 
reliability  of  these  judgments  within  raters  (comparing  magnitude  estimation  to  ranking)  and 
between  raters  on  each  scaling  method.  See  Table  2  showing  stem-and-leaf  plots  of  these  three 
reliability  distributions.  Mean  inter-rater  reliability  in  the  range  of  r  =  .80  and  above  is 
considered  to  be  satisfactory  for  testing  instruments  (Guilford,  1956). 

•  The  intra-rater  reliability  between  magnitude  estimations  and  rank  orderings  was 
assessed  using  the  Spearman’s  p  because  the  rank  orderings  are  ordinal  data;  p 
ranged  from  .88  to  .97  with  a  median  of  .94. 
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Inter-rater  reliability  for  the  magnitude  estimations  was  assessed  using  Pearson’s 
r.  The  r  ranged  from  .70  to  .97  with  a  mean  of  .88. 


•  Inter-rater  reliability  for  the  rank  orderings  was  assessed  using  Spearman’s  p;  p 
ranged  from  .73  to  .96  with  a  median  of  .87. 

Table  2 

Stem-and-Leaf  Plots  of  Intra-  and  Inter-Rater  Reliabilities 

(Stem  and  leaf  plots  are  a  method  of  displaying  frequency  distributions  in  a  summary  form  while 
still  retaining  the  individual  data  values.  For  example,  the  individual  r  values  for  the  intra-rater 
reliabilities  are  .88,  .89,  .92,  .94,  .95,  .97,  .97,  .97.) 


Intra-rater  reliabilities 

.8  8  9 

.9  2  4  5  7  7  7 


Inter-rater  reliabilities  (magnitude  estimations) 
.7  099 

.8  00  44478888899 

.9  011123445557 

Inter-rater  reliabilities  (rank  orderings) 

.7  3  6  7  8  8 

.8  0134555677778889 

.9  0  0  1  12  4  5 


Second,  the  mean  and  standard  deviations  (SDs)  of  the  magnitude  estimation  for 
each  card  were  calculated  to  define  the  pattern’s  difficulty  level.  Magnitude  estimations  were 
used  because  they  are  interval  data,  while  ranks  are  only  ordinal.  When  two  measmes  have 
similar  reliabilities,  the  interval  measure  allows  more  powerful  statistical  manipulations  (e.g., 
mean  instead  of  median).  Table  3  provides  the  individual  magnitude  estimations  from  each 
observer  and  the  mean  magnitude  estimation.  Table  4  provides  the  individual  rank  orderings  and 
the  median  rank  ordering. 
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Table  4 


Rank  Orderings  for  Individual  Raters  and  Average  (MDN  and  mean)  for  Each  Puzzle  Pattern  Card 
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1 

RANK ‘f 

RANK  2 

ranks’ 

RANK 'ilRANK  S' 

RANK  6 

RANK  f 

RANK  8 
■ . 27'" 

MDN  RANK 
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20 

23 

24 
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14 

25 

23.00 

2l’.63 
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2 

14 

22 

12 

71  19 

13 

10 

15 

13.50 

3 

4  . 

3 

5 

3 

3'  3 

“Toi 

2 

3.00 
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25 

27 

27 

24 

18 

12 

zr 

23 

24.50 

22.88 

5 

27 

25 

15 

27 

27 

27 

26 

20 

26.50 

24.25 

16 

2 

23 

9 

2 

2 

18 

11 

10.00 

10.38 

7 

2 

1 

4 

1.00 

1.63 

8 

10 

10 

17 

12.88 

9 

23 

19 

22 

26 

24 

23 

22 

12 

22.50 

21.38 

7 

9 

5 

17 

20 

19 

5 

6 

11 

4 

7 

4 

4 

15 

25 

BH 

10 

5.50 

12 

15 

6 

20 

13 

_ i 

4 

3 

9.50 

13 

22 

21 

25 

21 

16 

15 

23 

'  26 

21.50 

14 

11 

20 

8 

25 

11 

24 

17 

13 

26 

26 

22 

21 

8 

24 

25 

23.00 

21.63 

24 

9 

19 

23 

26 

15 

18 
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17 

5 

16 

19 

16 

6 

5 

9 

2 

7.50 

^  9.75 

18 

12 

10 

11 

6 

r  13 

16 

12 

9 

11.50 

11.13 

19 

13 

mmi 

14 

8 

6 

19 

15,00 

20 

9 

17 

17 

9 

9 

7 

11.75 

21 

6 

8 

5 

12 

21 

13 

24 

12.00 

22 

8 

4 

6 

12 

22 

20 

6 

5 

7.00 

10.38 

23 

13 

11 

10 

8 

14 

17 

11 

8 

11.00 

11.50 

24 

19 

14 

16 

15 

7 

7 

20 

22 

15.00 

25 

24 

12 

13 

18 

25 

22 

14 

14 

17.75 

26 

26 

15 

14 

26 

18 

21 

19 

19.88 

27 

1 

3 

2 

5 

3 

3 

_ 1 

3,13 

28 

39 

47 

37 

41 

39 

53 

36 

40.25 

38 

38 

32 

35 

45 

52 

34 

44 

35 

43 

54 

HiKil 

33 

47 

41 

41.00 

31 

49 

50 

40 

54 

53 

54 

49.50 

48.88 

32 

42 

56 

49 

50 

48 

48.50 

48.25 

33 

54 

49 

53 

44 

30 

42 

56 

39 

46.50 

45.88 

34 

56 

36 

39 

49 

42 

31 

39 

52 

40.50 

43.00 

caBi 

28 

31 

29 

32 

37 

43 

28 

31 

31.00 

32.38 

37 

43 

33 

29 

33 

51 

30 

49 

35.00 

38.13 

37 

34 

46 

47 

34 

54 

36 

40 

53 

43.00 

43.00 

38 

44 

35 

34 

46 

53 

38 

33 

42 

40.00 
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39 

55 

32 

52 

43 

29 

29 

54 

33 
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41 

53 

46 

55 

55 

50 

48 

56 
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41 

45 

44 

52 

40 

47 

38 

37 
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42 

47 

40 

41 

42 

52 

45 

51 

35 
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43 

51 

51 

53 

46 

46 

52 

55 
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35 

39 

48 

34 

35 

46 
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29 

30 

28 

31 

38 

30 

29 

38 
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52 

54 

40 

47 

50 

37 

44 

51 

48.50 
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51 

52 

48 

37 

51 

32 

45 

43 

46.50 

44.88 

48 

40 

55 

42 

51 

56 

56 

41 

50 
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48.88 

49 

45 

49 

56 

49 

44 

43 

29 

45.63 

32 

34 

45 
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35 

35 
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Finally,  a  regression  analysis  was  calculated  to  provide  a  visual  representation  of 
the  reliability  and  the  mean  trend  for  the  assessed  difficulty  of  these  56  patterns.  The  analysis 
regressed  the  individual  magnitude  estimations  (as  the  dependent  variable)  against  the  mean 
magnitude  estimation  (as  the  independent  variable).  First,  regression  was  computed  on  the  entire 
set  of  56  cards  (including  both  the  four-  and  the  nine-block  patterns).  The  linear  regression 
equation  was  Y’  =  1 .64  X  +  4.12  and  the  R  =  .82  or  explaining  82%  of  the  variance  (see  Figure 
2). 


Figure  2.  Regression  equation  and  scatter  plot  showing  magnitude  estimates  for  all  56  cards. 

A  portion  of  the  strong  correlation  reflects  the  clear  separation  between  the 
assessed  difficulty  of  the  four-block  and  the  nine-block  patterns.  This  is  consistent  with  the 
information  obtained  from  the  WAIS-R  manual  and  with  the  manner  in  which  magnitude 
estimations  were  assigned  (1  to  50  for  four-block  and  51  to  100  for  nine-block).  Clearly,  two 
levels  of  difficulty  exist  in  the  total  set  of  56  patterns. 

Next,  the  regression  was  calculated  within  each  of  the  two  pattern  sets  (four-  and 
nine-block  patterns): 
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•  The  regression  equation  for  the  four-block  patterns  was  Y’=  1.14X  +  7 .24  and 
the  R2  =  .49  or  explaining  49%  of  the  variance.  The  F-test  of  the  significance  of  the  explained 
variance  (greater  than  using  the  set  of  four-block  patterns  as  a  single  undifferentiated  difficulty 
level)  is  F  =  182.8,  p  <  .01  (see  Figure  3). 


Figure  3.  Regression  equation  and  scatter  plot  showing  magnitude  estimates  for  the  four-card 
patterns. 


•  The  regression  equation  for  the  nine-block  patterns  was  Y’  =  .58  X  +  52.32  and 
the  R2  =  .20  or  explaining  20%  of  the  variance.  The  F-test  of  the  significance  of  the  explained 
variance  (greater  than  using  the  set  of  nine-block  patterns  as  a  single  undifferentiated  difficulty 
level)  is  F  =  59.9,  p<  .01  (see  Figure  4). 
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Figiire  4.  Regression  equation  and  scatter  plot  showing  magnitude  estimates  for  the  nine-card 
patterns. 


The  significant  F  value  for  each  regression  analysis  indicates  that  there  is  a 
reliable  change  in  the  difficulty  level  within  both  the  four-block  and  the  nine-block  patterns  as 
well  as  between  the  two  pattern  sets.  These  results  are  consistent  vvdth  using  the  56  patterns  at 
four  separate  levels  of  difficulty.  There  may  be  more  separable  difficulty  levels,  but  four  will  be 
a  practical  number  to  test  the  workload  metrics  in  the  proposed  research.  A  quartile  split  was 
used  to  define  four  pattern  difficulty  levels:  very  easy,  easy,  moderate,  and  difficult. 
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Assessing  Spatial  Ability:  Cognitive  Laterality  Battery  (CLB) 

Rationale  for  Measuring  Spatial  Ability 

Individual  differences  in  subject’s  ability  to  perform  various  tasks  can  possibly 
cloud  the  results  obtained  in  experimental  research.  Therefore,  investigators  either  control  this 
possibility  by  holding  individual  difference  variables  constant  or  by  measuring  such  variables 
and  stratifying  the  sample  to  allow  the  measurement  of  their  impact.  In  the  present  study,  two  of 
these  variables  might  be  verbal  intelligence  and  spatial  ability.  Intelligence  within  either  a 
sample  of  university  students  or  a  sample  of  medical  personnel  is  not  likely  to  vary  greatly. 
Selection  into  these  populations  has  already  greatly  restricted  the  range  since  verbal  intelligence 
is  highly  correlated  with  academic  success.  However,  spatial  ability  may  range  widely  within 
either  population  because  it  is  not  so  directly  correlated  with  any  selection  procedure  for 
academic  success.  Hence,  a  measure  of  spatial  ability  was  sought  with  which  to  stratify  the 
subjects  in  our  proposed  experiment.  By  this  means,  it  would  be  possible  to  determine  whether 
spatial  ability  was  a  variable  affecting  performance  of  the  task  in  general,  or  interacting  with 
either  task  difficulty  level  or  communication  method  (co-location  versus  telemedicine). 

Selecting  Spatial  Ability  Instrument 

A  spatial  ability  test  battery  called  the  Cognitive  Laterality  Battery  has  been 
developed,  validated,  and  normed  by  Gordon  (1987).  The  entire  test  is  a  cognitive  laterality 
battery  intended  to  determine  the  specialized  functioning  in  each  cerebral  hemisphere.  Of 
interest  for  the  present  research  are  the  four  subscales  (tests)  that  comprise  the  measurement  of 
spatial  ability.  These  four  tests  are  called  localization,  orientation,  form  completion,  and 
touching  blocks.  Each  measures  some  aspect  of  spatial  ability,  and  collectively,  they  provide  a 
reliable  measure  of  this  ability. 

Cognitive  Laterality  Battery  Test  Materials 

The  CLB  is  available  commercially  in  a  package  that  includes  all  administration 
instructions,  stimulus  materials  in  the  form  of  slides  and  taped  instructions,  data  sheet  templates, 
and  scoring  instructions  and  answer  keys.  In  addition,  the  norming  data  (means,  standard 
deviation,  and  frequency  distributions)  for  several  populations  are  provided. 

Equipment  to  Administer  the  Cognitive  Laterality  Battery 

To  administer  the  four  spatial  ability  tests,  the  following  equipment  is  used:  slide 
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projector  (Kodak  Carousel  5400)  and  tape  recorder-player  (General  Electric  #3-5622A).  The 
administration  of  the  tests,  including  their  instructions  and  material  distribution,  requires 
approximately  60  minutes;  of  this  time,  30  minutes  are  required  for  actual  data  collection  (time 
spent  viewing  the  stimuli  and  marking  responses).  Subjects  can  be  tested  in  groups  of  as  many 
as  10  people,  depending  upon  the  viewing  conditions.  It  is  necessary  for  each  subject  to  be  able 
to  see  clearly  the  stimulus  slides  projected  on  a  screen. 

Spatial  Ability  Subscales 

The  four  spatial  ability  subscales  (i.e.,  localization,  orientation,  form  completion, 
and  touching  blocks)  are  described  next: 

•  Localization  is  a  test  of  the  observer’s  ability  to  reproduce  the  location  of  an  x 
marked  on  a  projected  slide  by  marking  its  corresponding  location  on  a  paper 
template.  There  are  24  slides. 

•  Orientation  is  a  mental  rotation  task.  Observers  view  three  3D  geometric  figures 
and  determine  which  two  figures  are  actually  the  same  object.  There  are  24  tasks. 

•  Form  Completion  consists  of  line  drawings  of  common  figures  with  portions  of 
the  line  segments  erased  (missing).  The  observer’s  task  is  to  name  the  figure. 
There  are  24  figures. 

•  Touching  Blocks  shows  a  stack  of  blocks  in  which  some  blocks  are  numbered. 

The  observer’s  task  is  to  count  the  number  of  blocks  touching  all  the  numbered 
blocks.  There  are  six  stacks. 

Code  Results 

The  results  are  scored  by  referring  to  the  answer  key  for  each  test,  except  the 
location  subscale.  The  location  subscale  requires  that  the  experimenter  score  the  distance  in 
millimeters  that  the  observer’s  response  is  fi'om  the  target  location.  This  is  a  time-consuming 
scoring  procedure,  even  using  the  template  provided  in  the  test  booklet. 

Tabulate  Results 

The  results  can  be  used  as  subscale  values  so  that  they  can  be  compared  to  the 
adult  norming  values  for  each  subscale  in  the  CLB  manual.  Alternatively,  a  general  spatial 
abilities  score  can  be  obtained  by  adding  all  the  subscale  scores  for  a  given  subject.  The  score 
for  the  localization  subscale  is  an  error  measurement  and  hence  is  negatively  correlated  with 
spatial  ability.  Therefore,  the  actual  localization  score  can  be  subtracted  from  any  constant  larger 
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than  the  largest  error  score  in  the  sample.  This  transformed  score  will  then  be  positively 
correlated  with  spatial  ability  and  can  be  added  to  the  remaining  three  subscale  scores  to  obtain  a 
total  spatial  ability  measure  for  each  subject. 

Application  for  Proposed  Testing 

The  four  spatial  ability  subscales  of  the  Cognitive  Laterality  Battery  are  available, 
reliable,  validated,  uncontaminated,  and  manageable  methods  of  measuring  spatial  ability.  The 
CLB  is  recommended  as  a  satisfactory  method  of  stratifying  spatial  ability. 

Developing  Test  Paradigm  Procedures 

The  final  activity  in  completing  testing  of  this  paradigm  was  to  design  and  test  the 
research  protocol  itself.  A  generic  workload  measure  was  sought,  which  will  assess  the 
cognitive  requirements  that  are  likely  to  exist  in  most  medical  procedures.  Furthermore,  there  is 
a  specific  interest  in  targeting  the  changes  in  cognitive  workload  that  occur  with  the  introduction 
of  telecommunication  for  those  procedures.  Hence,  a  research  protocol  to  test  the  interaction  of 
three  variables  was  designed.  The  three  variables  are  type  of  worklocid  metric,  task  difficulty, 
and  communication  condition.  A  measure  for  stratifying  subjects  by  spatial  ability  was  included. 

Workload  Metric 

The  three  candidate  workload  measures  selected  were  the  Subjective  Workload 
Assessment  Technique  (SWAT)  (Reid  &  Nygren,  1988),  the  NASA-Task  Load  Index  (TLX) 
(Hart  &  Mashkati,  1988),  and  the  Modified  Cooper-Harper  (MCH)  (Boff  &  Lincoln,  1988). 

Task  Difficulty 

A  stuTOgate  puzzle  pattern  task  was  developed  as  described  earlier.  The 
magnitude  estimations  of  difficulty  were  used  to  produce  four  separate  levels  of  task  difficulty 
which  will  be  used  to  assess  the  sensitivity  of  the  three  workload  metrics.  Thirteen  patterns  of 
each  difficulty  level  were  designed  and  tested. 

Communication  Condition 

The  two  communication  conditions  are  co-location  and  telecommunication.  In 
the  co-location  condition,  the  two  team  members  are  located  in  the  same  room  and  view  the 
working  area  directly.  In  the  telecommvuiication  condition,  they  are  located  in  separate  rooms 
and  have  to  communicate  via  video  and  audio  communication. 
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Spatial  Ability 

The  four  types  of  spatial  ability  teams  are  constructed  by  using  the  subjects’ 
scores  on  the  spatial  ability  subscales  of  the  Cognitive  Laterality  Battery.  The  four  types  of 
teams  are  high;high,  low:high,  highdow,  lowdow,  in  which  the  first  member  is  the  instructor  and 
the  second  is  the  builder. 

Equipment 

To  test  the  feasibility  of  the  anticipated  experiment,  it  was  necessary  to  determine 
the  telecommunication  equipment  that  would  be  used  in  the  experiment.  The  major  video 
components  of  this  equipment  were  obtained  as  a  loan  from  the  Lf.S.  Army  Research  Laboratory 
at  Aberdeen  Proving  Ground,  Maryland.  These  consisted  of  two  video  cameras  (Panasonic  VHS 
AG  160  Proline  camcorder  and  AC  adapter)  and  two  television  monitors  (19-inch  Zenith  Model 
No.  L1912W).  Additional  equipment,  which  was  obtained  from  local  sources,  consisted  of  two 
TRC-512, 49-MHz  FM  Radio  Shack  wireless  transmitter-receivers  (“walkie-talkies”)  to  permit 
audio  communication  between  the  team  members  in  the  telemedicine  condition,  a  RST-84V 
Radio  Shack  tripod,  and  a  25-foot  coaxial  cable  to  connect  the  remote  monitor  to  the  camcorder. 
See  Figure  5  for  diagram  of  the  equipment  setup. 

Design 

To  select  the  best  workload  metric  for  use  in  evaluating  telemedicine  applications, 
the  following  mixed  factors  design  with  three  independent  variables  was  developed.  Spatial 
ability  of  teams  is  varied  at  four  levels:  high:high,  high:low,  low:high,  and  low:low.  The 
remaining  variables  are  both  repeated  measures:  communication  condition  (co-located  versus 
telecommunication)  and  four  levels  of  puzzle  pattern  difficulty  (very  easy,  easy,  moderate,  and 
difficult).  Each  level  of  puzzle  difficulty  occurs  on  a  total  of  12  trials.  Half  of  these  are  in  the 
co-located  and  half  in  the  telecommvinication  condition.  In  each  communication  condition, 
workload  for  two  of  the  trials  is  assessed  using  each  of  the  three  workload  metrics  (SWAT, 
NASA-TLX,  and  MCH).  Thus,  a  total  of  48  trials  (puzzle  patterns)  are  completed  by  each  team. 
A  diagram  of  this  mixed  factors  design  is  as  follows:  4  spatial  ability  x  (2  communication  x  4 
task  difficulty  x  3  workload  metrics  x  2  replications).  All  levels  of  the  repeated  measures 
variables  will  be  counterbalanced  or  randomized  to  avoid  confounding  order  with  experimental 
treatment  results. 
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Monitor  Connections 

Figure  5.  Diagrams  of  equipment  setup. 
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Camcorder  Connections 


Figure  5.  (continued). 
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Test  Procedure  and  Modify  Iteratively 

The  actual  procedure  for  the  experimental  paradigm  required  modification  from 
its  conceptualization  to  its  final  form.  This  iteration  was  accomplished  by  the  principal 
investigator  and  the  research  assistant  alternatively  serving  as  experimenter  and  subject  or  both 
as  team  members  until  the  procedure,  instructions,  equipment,  training,  and  measurement  issues 
had  been  satisfactorily  developed.  The  following  parameters  were  established  empirically  during 
these  iterative  modifications: 

•  Number  of  training  trials 

•  Preview  time  for  patterns 

•  Audio  communication  equipment 

•  Field  of  view  and  camera  angle 

•  Permissible  puzzle  patterns  constrained  by  video  view 

•  Instructions  to  team  members 

•  Method  of  recording  errors  (sketch) 

•  Anticipated  number  of  errors  influenced  design  of  dependent  variables. 

•  Time  flowed  for  pattern  building 

•  Power  for  obtaining  workload  measures  (by  two-trial  blocks,  not  for  each  trial.) 

Procedme 

Subjects  are  introduced  to  the  experimental  room  and  the  communication 
equipment.  They  are  told  that  their  task  is  to  work  together  as  teams  to  build  a  series  of  puzzle 
patterns  from  blocks.  Before  data  collection  begins,  each  team  completes  seven  practice  trials  in 
which  they  become  familiar  with  one  another’s  terminology  and  typical  strategies. 

In  the  telecommunication  condition,  one  team  member,  serving  as  the  instructor, 
sits  in  the  room  with  the  television  monitor  (Room  I)  and  the  other,  serving  as  the  builder,  in  the 
room  with  the  camcorder  (Room  B).  The  instructor  has  a  stack  of  24  patterns.  The  instructor’s 
task  is  to  describe  how  to  build  a  given  pattern.  The  builder  has  the  blocks  on  the  table.  The 
builder’s  task  is  to  build  the  pattern  that  is  described.  Both  subjects  view  the  two  patterns  for  a 
given  condition  (e.g.,  moderate  difficulty,  telecommunication,  MCH)  for  10  seconds.  After  the 
preview,  the  instructor  is  the  only  one  to  see  the  paper  pattern.  The  measure  of  time  begins  when 
the  experimenter  says  “begin”  for  each  trial.  It  ends  when  the  instructor  signals  completion. 

After  both  trials  are  completed  in  a  given  condition,  a  workload  rating  is  obtained. 

A  similar  procedure  is  used  for  the  co-location  condition  except  that  the  two  team 
members  are  in  the  same  room.  Again,  the  builder  is  the  only  team  member  allowed  to  touch  the 
blocks  and  the  instructor  is  the  only  one  to  see  the  paper  pattern.  Time  to  completion,  errors  in 
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pattern  built  (including  the  sketch  of  any  incorrect  result),  and  workload  ratings  are  recorded  as 
the  dependent  variables. 

The  instructions  for  the  instructor  and  the  builder  team  members  in  both  the 
telecommunication  and  the  co-location  conditions  are  given  in  Table  5. 

Complete  Protocol  With  a  Sample  Team 

After  the  procedural  modifications  are  completed,  the  entire  protocol  (omitting  the 
CLB)  was  completed  using  two  graduate  students  as  subjects.  The  procedure  required  2  hours  to 
complete  all  seven  training  trials  and  48  data  collection  trials.  The  results  for  this  team  are 
summarized  in  Table  6. 

On  the  basis  of  these  data,  two  further  changes  were  made  in  the  protocol: 

•  The  experimenter’s  procedure  checklist  was  changed  to  make  it  easier  to  collect 
the  results  without  the  errors  that  led  to  the  loss  of  the  NASA-TLX  data  in  the 
sample  run. 

•  Debriefing  questions  were  added  to  collect  information  systematically  about  the 
subject’s  preferences  for  one  or  another  of  the  workload  measures. 

Conclusions 

The  work  described  in  this  paper  was  imdertaken  to  establish  a  research  paradigm  for 
developing  a  satisfactory  evaluation  tool  for  telemedicine  applications.  These  efforts  were 
successful  in  establishing  the  feasibility  of  that  research.  A  surrogate  task  (team  building  of 
block  patterns)  was  developed  and  56  patterns  of  measured  difficulty  were  designed,  produced, 
and  tested.  This  task  can  be  used  as  it  is  and  can  be  modified  to  incorporate  a  greater 
psychomotor  component,  when  such  a  component  proves  necessary  in  some  experiments.  The 
telecommunication  equipment  necessary  to  conduct  the  research  was  acquired,  set  up,  and  tested. 
The  possibility  of  tmcontrolled  individual  differences  in  spatial  ability  was  considered  for  some 
populations  and  a  measure  of  spatial  ability  was  determined  so  that  teams  can  be  stratified  on  this 
measure.  A  scientifically  sound  research  design  and  a  procedure  for  implementing  that  design 
were  developed.  The  entire  procedure  was  tested  and  final  adjustments  were  made. 
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Table  5 


Instructions  for  Builder  and  Instructor  in  Telecommunication  and  Collocation  Conditions 


Telecommunication  Condition: 

Tnstnictions  for  the  builder:  In  this  portion  of  the  study  you  will  receive  a  set  of  instructions  given  to  you  by 
your  team  mate  located  in  another  room.  You  will  hear  these  instructions  over  your  walkie-talkie.  You  will  place  these 
red  and  white  blocks  as  you  are  told  to  form  one  of  the  three  patterns  you  have  viewed.  The  blocks  consist  of  two  red 
sides,  two  white  sides,  and  two  sides  split  in  half  so  that  they  are  both  red  and  white.  This  camera  is  here  so  that  your 
team  mate  may  monitor  your  progress  and  correct  any  mistakes  you  may  make.  Some  patterns  will  seem  harder  than 
others.  After  completing  three  patterns,  you  will  be  asked  to  fill  out  a  form  that  describes  the  amount  of  work  you  think 
was  involved  in  completing  the  previously  built  patterns.  This  is  a  subjective  measure  and  will  not  be  the  same  for  all 
people  so  do  not  feel  as  though  your  ratings  must  meet  a  set  standard.  After  you  have  completed  the  measure  of 
workload,  you  will  build  three  more  patterns  and  fill  out  another  workload  evaluation  and  so  on  until  all  patterns  are 
completed  (there  are  twenty-seven).  Your  goal  is  to  work  as  quickly  as  possible  while  attempting  to  build  a  complete 
correct  pattern.  Your  team  will  receive  a  twenty-five-dollar  reward  if  it  is  one  of  the  two  fastest  teams  with  the  fewest 
errors.  You  may  ask  your  team  mate  to  repeat  any  instructions  you  do  not  understand  by  depressing  the  talk  button  on 
your  own  walkie-talkie.  Are  there  any  questions? 

Instructions  for  the  nerson  with  the  patterns:  In  this  portion  of  the  study  you  will  be  asked  to  describe  these 
patterns  you  see  before  you  now  to  your  team  mate  located  in  another  room.  Your  team  mate  has  a  set  of  blocks  in  order 
to  achieve  this  construction  which  have  two  red  sides,  two  white  sides,  and  two  sides  that  are  split  in  half  so  that  they  are 
both  red  and  white.  You  will  communicate  to  your  team  mate  via  a  set  of  walkie-talkies,  one  of  which  you  see  before 
you.  You  talk  by  depressing  the  talk  button  for  the  duration  of  the  time  you  need  to  speak.  Your  team  mate  has  the 
option  of  asking  you  to  repeat  any  instructions  s/he  does  not  understand.  Keep  in  mind  your  team  mate  has  viewed  the 
patterns  you  are  describing  for  thirty  seconds  for  nine-block  patterns  and  fifteen  seconds  for  four-block  patterns.  The 
television  monitor  is  here  so  that  you  may  monitor  your  team  mate’s  progress  and  correct  any  errors  s/he  may  make. 
Once  you  have  explained  three  patterns  you  will  be  asked  to  fill  out  a  workload  evaluation  which  will  let  the 
experimenter  know  how  much  work  you  believe  was  involved  in  completing  this  phase  of  the  experiment.  When  this 
evaluation  is  completed,  you  will  describe  three  more  patterns  and  receive  another  evaluation  and  so  on  until  all  patterns 
are  completed  (there  are  twenty-seven).  Workload  evaluations  are  subjective;  flierefore  your  opinions  may  or  may  not 
match  someone  else’s.  Do  not  worry;  you  are  not  trying  to  meet  a  standard,  just  state  your  own  opinion.  Your  goal  is  to 
complete  these  patterns  as  quickly  as  possible,  making  as  few  errors  as  possible.  At  the  end  of  the  experiment,  the  two 
teams  with  the  fastest  times  and  the  fewest  errors  will  receive  a  twenty-five-dollar  bonus.  Are  there  any  questions? 

Collocated  Condition: 

Instructions  for  both  subjects:  In  this  phase  of  the  study  you  will  be  asked  to  construct  the  patterns  you  see 
before  you.  Only  one  of  you  will  have  access  to  the  patterns  while  the  other  will  have  the  bloc^.  However,  you  bo* 
will  be  permitted  to  view  the  three  patterns  occurring  in  the  ensuing  block.  If  the  patterns  contain  nine  blocks,  you  will 
be  allowed  to  view  them  for  thirty  seconds  and  if  there  are  four  blocks,  you  may  view  them  for  fifteen  seconds.  Only 
one  designated  person  may  touch  the  blocks.  The  person  with  the  designs  must  describe  to  the  other  person  how  to 
situate  the  blocks  in  order  to  create  the  pattern  s/he  sees.  Each  block  consists  of  two  red  sides,  two  white  sides,  and  two 
sides  split  in  half  so  that  they  are  both  red  and  white.  The  builder  may  at  any  time  ask  the  instructor  to  repeat 
instructions  that  were  not  understood;  however,  the  builder  may  not  ask  to  see  the  design  itself  nor  may  the  instructor 
show  the  design  to  his  or  her  team  mate.  After  completing  three  designs,  you  will  both  be  asked  to  fill  out  a  workload 
evaluation  which  will  tell  the  experimenter  how  much  work  you  each  feel  was  involved  in  completing  this  phase  of  the 
experiment.  These  evaluations  are  subjective  so  the  evaluations  you  both  fill  out  may  not  reflect  the  same  ideas.  Do  not 
worry  about  matrhing  your  partner’s  evaluation;  the  experimenter  wants  to  know  what  each  of  your  personal  views  are. 
When  this  evaluation  is  completed,  you  will  be  asked  to  complete  three  more  patterns  and  give  another  evaluation  and  so 
on  until  all  patterns  are  completed  (there  are  twenty-seven).  Your  goal  is  to  complete  the  patterns  as  quickly  as  possible 
while  making  as  few  errors  as  possible.  At  the  end  of  the  experiment,  *e  two  teams  with  the  fastest  times  and  the  fewest 
errors  will  receive  a  twenty-five-dollar  reward.  Are  there  any  questions? 
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Table  6 


Sample  of  One  Team’s  Performance  of  Experimental  Protocol 


Collocation 


Pattern 

Very  easy 

Easy 

Moderate 

Difficult 

difficulty 
Time  (in  sec.) 

9 

9.8 

24 

25.8 

SWAT 

33 

33 

72 

72 

NASA-TLX 

Lost:  Experimenter  error 

MCH 

30 

20 

50 

60 

Telecommunication 

Pattern 

Very  easy 

Easy 

Moderate 

Difficult 

difficulty 
Time  (in  sec.) 

15.8 

27.67 

39.17 

35.17 

SWAT 

33 

44 

61 

61 

NASA-TLX 

Lost:  Experimenter  error 

MCH 

20 

30 

30 

70 

NOTE:  WL  adjusted  to  0  to  100  range 
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